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Abstract 



We present the design and analysis of a near linear-work parallel algorithm for solving symmetric diag- 
onally dominant (SDD) linear systems. On input of a SDD n-by-n matrix A with m non-zero entries and a 
■ vector b, our algorithm computes a vector x such that ||x — < e ■ in 0(to log'^^^-' n log i) 

c/2 \ work and 0{m}/^^^ log 7) depth for any fixed 6* > 0. 

The algorithm relies on a parallel algorithm for generating low-stretch spanning trees or spanning sub- 
graphs. To this end, we first develop a parallel decomposition algorithm that in poly logarithmic depth and 
0(|i?|) work', partitions a graph into components with polylogarithmic diameter such that only a small frac- 
^ \ tion of the original edges are between the components. This can be used to generate low-stretch spanning 

trees with average stretch 0(n") in 0(n^+") work and 0(n") depth. Alternatively, it can be used to gen- 
, erate spanning subgraphs with polylogarithmic average stretch in 0(|_E|) work and polylogarithmic depth. 

We apply this subgraph construction to derive a parallel linear system solver By using this solver in known 
applications, our results imply improved parallel randomized algorithms for several problems, including 
single-source shortest paths, maximum flow, minimum-cost flow, and approximate maximum flow. 



1 Introduction 

Solving a system of linear equations Ax = 6 is a fundamental computing primitive that Ues at the core of many 
numerical and scientific computing algorithms, including the popular interior-point algorithms. The special case 
of symmetric diagonally dominant (SDD) systems has seen substantial progress in recent years; in particular, 
the ground-breaking work of SpieLman and Teng showed how to solve SDD systems to accuracy e in time 
0(mlog(i)), where m is the number of non-zeros in the n-by-n-matrix A? This is algorithmically significant 
since solving SDD systems has implications to computing eigenvectors, solving flow problems, finding graph 
sparsifiers, and problems in vision and graphics (see [SpilO, TenlO] for these and other applications). 

In the sequential setting, the current best SDD solvers run in 0{m log n(log log n)^ log(i)) time [KMPl 1]. 
However, with the exception of the special case of planar SDD systems [KM07], we know of no previous 



*Contact Address: 5000 Forbes Ave. Computer Science Department. Pittsburgh, PA 15213. E-mail: {guyb, anupamg, 
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'The O(-) notion hides polylogarithmic factors. 

^The Spielman-Teng solver and all subsequent improvements are randomized algorithms. As a consequence, all algorithms relying 
on the solvers are also randomized. For simplicity, we omit standard complexity factors related to the probability of error. 
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parallel SDD solvers that perform neai^-linear^ work and achieve non-trivial parallelism. This raises a natural 
question: Is it possible to solve an SDD linear system in o{n) depth and 0{m) work? This work answers this 
question affirmatively: 

Theorem 1.1 For any fixed > and any e > 0, there is an algorithm SDDSolve that on input ann x n SDD 
matrix A with m non-zero elements and a vector b, computes a vector x such that \\x — A'^bWA < £ • II^^^IU 
in 0(m log'^'-^^ n log j) work and 0(m^^^~^^ log depth. 

In the process of developing this algorithm, we give parallel algorithms for constructing graph decompo- 
sitions with strong-diameter guarantees, and pai^allel algorithms to construct low-stretch spanning trees and 
low-stretch ultra-sparse subgraphs, which may be of independent interest. An overview of these algorithms and 
their underlying techniques is given in Section 3. 

Some Applications. Let us mention some of the implications of Theorem 1.1, obtained by plugging it into 
known reductions. 

— Construction of Spectral Sparsifiers. Spielman and Srivastava [SS08] showed that spectral sparsifiers can 
be constructed using O(logn) Laplacian solves, and using our theorem we get spectral and cut sparsifiers in 
0{m}/^~^^) depth and 0{m) work. 

— Flow Problems. Daitsch and Spielman [DS08] showed that various graph optimization problems, such as 
max-flow, min-cost flow, and lossy flow problems, can be reduced to 0{m}/'^) applications^ of SDD solves 
via interior point methods described in [Ye97, RenOl, BV04]. Combining this with our main theorem implies 
that these algorithms can be parallelized to run in 0(m5/6+^) depth and 0{m?/'^) work. This gives the first 
parallel algorithm with o(n) depth which is work-efficient to within polylog(n) factors relative to the sequential 
algorithm for all problems analyzed in [DS08]. In some sense, the parallel bounds are more interesting than 
the sequential times because in many cases the results in [DS08] are not the best known sequentially (e.g. 
max-flow) — but do lead to the best know parallel bounds for problems that have traditionally been hard to 
parallelize. Finally, we note that although [DS08] does not explicitly analyze shortest path, their analysis 
naturally generalizes the LP for it. 

Our algorithm can also be applied in the inner loop of [CKM+10], yielding a 0(m^/6+^poly(e-i)) depth 
and 0(m^/'^poly(e^^)) work algorithm for finding 1 — e approximate maximum flows and 1 + e approximate 
minimum cuts in undirected graphs. 



2 Preliminaries and Notation 

We use the notation 0(/(n)) to mean 0(/(n) polylog(/(n))). We use A^B to denote disjoint unions, and [k] 
to denote the set {1, 2, ... , k}. Given a graph G = {V, E), let dist{u, v) denote the edge-count distance (or hop 
distance) between u and v, ignoring the edge lengths. When the graph has edge lengths w{e) (also denoted by 
We), let dciu, v) denote the edge-length distance, the shortest path (according to these edge lengths) between 
u and V. If the graph has unit edge lengths, the two definitions coincide. We drop subscripts when the context 
is clear. We denote by V{G) and E{G), respectively, the set of nodes and the set of edges, and use n = \ V{G)\ 

^i.e. linear up to polylog factors. 

''here O hides log U factors as well, where it's assumed that the edge weights are integers in the range [1 . . . f/] 
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and m = \E{G)\. For an edge e = {u,v}, the stretch of e on G' is strQi{e) = dG'{u,v)/w{e). The total 
stretch of G = {V, E, w) with respect to G' is strc {E{G)) = YleeE{G) sti'G'(s)- 

Given G = {V, E), a distance function 5 (which is either dist or d), and a partition of V into Ci tt) C2 tt) 
. . . l±) Cp, let G[Gi\ denote the induced subgraph on set Cj. The weak diameter of Gi is maxtt^^gc*. (^g(^*5 
whereas the strong diameter of Q is maXu,i,eCi '^GfCi] (^i the former measures distances in the original 
graph whereas the latter measures distances within the induced subgraph. The strong (or weak) diameter of the 
partition is the maximum strong (or weak) diameter over all the components Cj's. 

Graph Laplacians. For a fixed, but arbitrary, numbering of the nodes and edges in a graph G = {V,E), the 
Laplacian Lq of G is the |y|-by-|y| matrix given by 



-Wij ifij^j 



When the context is clear, we use G and Lg interchangeably. Given two graphs G and H and a scalar /i G R, 
we say G ^ fiH if fiLn — Lq is positive semidefinite, or equivalently Lqx < fix~^ Lhx for all vector 
X G MI^'I. 

Matrix Norms, SDD Matrices. For a matrix A, we denote by A'^ the Moore-Penrose pseudoinverse of A (i.e., 
^4+ has the same null space as A and acts as the inverse of A on its image). Given a symmetric positive semi- 
definite matrix A, the A-norm of a vector x is defined as \\x\\a = V x~^ Ax. A matrix A is symmetric diagonally 
dominant (SDD) if it is symmetric and for all i, Ai^i > "^j^i Solving an SDD system reduces in 0(?n) 

work and ©(log*^^^^ m) depth to solving a graph Laplacian (a subclass of SDD matrices coiTcsponding to 
undirected weighted graphs) [Gre96, Section 7.1]. 

Parallel Models. We analyze algorithms in the standard PRAM model, focusing on the work and depth param- 
eters of the algorithms. By work, we mean the total operation count — and by depth, we mean the longest chain 
of dependencies (i.e., parallel time in PRAM). 

Parallel Ball Growing. Let Bg{s, r) denote the ball of edge-count distance r from a source s, i.e., Bg{s, r) = 
{v G V{G) : distcisjv) < r}. We rely on an elementary form of parallel breadth-first search to compute 
Bg{s, r). The algorithm visits the nodes level by level as they are encountered in the BFS order. More precisely, 
level contains only the source node s, level 1 contains the neighbors of s, and each subsequent level i + 1 
contains the neighbors of level i's nodes that have not shown up in a previous level. On standai^d parallel 
models (e.g., CRCW), this can be computed in 0(r log n) depth and 0{m' + n') work, where m' and n' are 
the total numbers of edges and nodes, respectively, encountered in the search [UY91, KS97]. Notice that we 
could achieve this runtime bound with a variety of graph (matrix) representations, e.g., using the compressed 
sparse-row (CSR) format. Our applications apply ball growing on r roughly 0{log'^^^^ n), resulting in a small 
depth bound. We remark that the idea of small-radius parallel ball growing has previously been employed in 
the context of approximate shortest paths (see, e.g., [UY91, KS97, CohOO]). There is an alternative approach 
of repeatedly squaring a matrix, which can give a better depth bound for large r at the expense of a much larger 
work bound (about n^). 

Finally, we state a tail bound which will be useful in our analysis. This bound is easily derived from 
well-known facts about the tail of a hypergeometric random variable [Chv79, Hoe63, Ska09]. 

Lemma 2.1 (Hypergeometric Tail Bound) Let H be a hypergeometric random variable denoting the number 
of red balls found in a sample of n balls drawn from a total of N balls of which M are red. Then, if fi = 
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E [H] = nM/N, then 



Pr [H > 2fi] < e-^l'^ 



Proof: We apply the following theorem of Hoeffding [Chv79, Hoe63, Ska09]. For any t > 0, 
where p = ^/n. Using t = p,-we have 



p + t/ \l — p — t 



l-2p 



where we have used the fact that 1 + x < exp(x). 



3 Overview of Our Techniques 



In the general solver framework of Spielman and Teng [ST06, KMPIO], near linear-time SDD solvers rely 
on a suitable preconditioning chain of progressively smaller graphs. Assuming that we have an algorithm 
for generating low-stretch spanning trees, the algorithm as given in [KMPIO] parallelizes under the following 
modifications: (i) perform the partial Cholesky factorization in parallel and (ii) terminate the preconditioning 
chain with a graph that is of size approximately m^/^. The details in Section 6 are the primary motivation of 
the main technical part of the work in this chapter, a parallel implementation of a modified version of Alon et 
al.'s low-stretch spanning tree algorithm [AKPW95]. 

More specifically, as a first step, we find an algorithm to embed a graph into a spanning tree with aver- 
age stretch 2'^^^^°Snloglogn) jjj Q(^) ^Qj-ij. ^jjj 0(20(Vlognloglogn) ^) ^^^^^^ ^j^gj-g ^ jj^g ^^^^^ jj^g 

largest to smallest distance in the graph. The original AKPW algorithm relies on a parallel graph decompo- 
sition scheme of Awerbuch [Awe85], which takes an unweighted graph and breaks it into components with a 
specified diameter and few crossing edges. While such schemes are known in the sequential setting, they do not 
parallelize readily because removing edges belonging to one component might increase the diameter or even 
disconnect subsequent components. We present the first near linear-work parallel decomposition algorithm that 
also gives strong-diameter guarantees, in Section 4, and the tree embedding results in Section 5.1. 

Ideally, we would have liked for our spanning trees to have a polylogarithmic stretch, computable by a poly- 
logarithmic depth, near linear-work algorithm. However, for our solvers, we make the additional observation 
that we do not really need a spanning tree with small stretch; it suffices to give an "ultra-sparse" graph with 
small stretch, one that has only 0{m/ polylog(n)) edges more than a tree. Hence, we present a parallel algo- 
rithm in Section 5.2 which outputs an ultra-sparse graph with 0(polylog(n)) average stretch, performing 0{m) 
work with 0(polylog(?i)) depth. Note that this removes the dependence of log A in the depth, and reduces both 
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the stretch and the depth from 

20(viogniogiogn) 0(polylog(n)).^ When combined with the aforementioned 
routines for constructing a SDD solver presented in Section 6, this low-stretch spanning subgraph construction 
yields a parallel solver algorithm. 

4 Parallel Low-Diameter Decomposition 

In this section, we present a parallel algorithm for partitioning a graph into components with low (strong) 
diameter while cutting only a few edges in each of the k disjoint subsets of the input edges. The sequential 
version of this algorithm is at the heart of the low-stretch spanning tree algorithm of Alon, Kaip, Peleg, and 
West (AKPW) [AKPW95]. 

For context, notice that the outer layer of the AKPW algorithm (more details in Section 5) can be viewed as 
bucketing the input edges by weight, then partitioning and contracting them repeatedly. In this view, a number 
of edge classes are "reduced" simultaneously in an iteration. Further, as we wish to output a spanning subtree 
at the end, the components need to have low strong-diameter (i.e., one could not take "shortcuts" through other 
components). In the sequential case, the strong-diameter property is met by removing components one after 
another, but this process does not parallelize readily. For the parallel case, we guarantee this by growing balls 
from multiple sites, with appropriate "jitters" that conceptually delay when these ball-growing processes start, 
and assigning vertices to the first region that reaches them. These "jitters" terms ai^e crucial in controlling the 
probability that an edge goes across regions. But this probability also depends on the number of regions that 
could reach such an edge. To keep this number small, we use a repeated sampling procedure motivated by 
Cohen's (/3, iy)-cover construction [Coh93]. 

More concretely, we prove the following theorem: 

Theorem 4.1 (Parallel Low-Diameter Decomposition) Given an input graph G = {V, Ei^ . . . ^ E^) with 
k edge classes and a "radius" parameter p, the algorithm Partition(G, p), upon termination, outputs a 
partition ofV into components C = (Ci, C2, ■ ■ ■ , Cp), each with center Si such that 

1. the center Si E Cifor all i G [p], 

2. for each i, every u ^ Ci satisfies distQ^(j.^ {si, u) < p, and 

3. for all j = 1, . . . ,k, the number of edges in Ej that go between components is at most \Ej\ ■ ^i'^*°g 
where ci is an absolute constant. 

Furthermore, Partition runs in 0{m log^ n) expected work and 0(/jlog^ n) expected depth. 
4.1 Low-Diameter Decomposition for Simple Unweighted Graphs 

To prove this theorem, we begin by presenting an algorithm splitGraph that works with simple graphs with 
only one edge class and describe how to build on top of it an algorithm that handles multiple edge classes. 

The basic algorithm takes as input a simple, unweighted graph G = {V, E) and a radius (in hop count) 
parameter p and outputs a partition V into components Gi, . . . ,Gp, each with center Sj, such that 

^As an aside, this construction of low-stretch ultra-sparse graphs shows how to obtain the 0(m)-time linear system solver of 
Spielman and Teng [ST06] without using their low-stretch spanning trees result [EEST05, ABN08]. 
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(PI) Each center belongs to its own component. That is, the center Si G Cj for all i S [p]; 

(P2) Every component has radius at most p. That is, for each i G [p], every u ^ Ci satisfies distQyc.-^{si,u) < 

p; 

(P3) Given a technical condition (to be specified) that holds with probability at least 3/4, the probability that 
an edge of the graph G goes between components is at most ^ log'^ n. 

In addition, this algorithm runs in 0(?n log^ 77,) expected work and 0{plog^ n) expected depth. (These proper- 
ties should be compared with the guarantees in Theorem 4. 1 .) 

Consider the pseudocode of this basic algorithm in Algorithm 4.1. The algorithm takes as input an un- 
weighted n-node graph G and proceeds in T = 0(logr7) iterations, with the eventual goal of outputting a 
partition of the graph G into a collection of sets of nodes (each set of nodes is known as a component). Let 
= , ) denote the graph at the beginning of iteration t. Since this graph is unweighted, the distance 
in this algorithm is always the hop-count distance dist{-, •). For iteration i = 1, . . . , T, the algorithm picks a set 
of starting centers S'^*^ to grow balls from; as with Cohen's (/3, M^)-cover, the number of centers is progressively 
larger with iterations, reminiscent of the doubling trick (though with more careful handling of the growth rate), 
to compensate for the balls' shrinking radius and to ensure that the graph is fully covered. 

Still within iteration t, it chooses a random "jitter" value (5^*^ {0, 1,. . . ,R} for each of the centers 
in 5^*^ and grows a ball from each center s out to radius r^*) — 6i^\ where r^*^ = 2 log n (-^ — t + 1). Let 
X^*) be the union of these balls (i.e., the nodes "seen" from these starting points). In this process, the "jitter" 
should be thought of as a random amount by which we delay the ball-growing process on each center, so 
that we could assign nodes to the first region that reaches them while being in control of the number of cross- 
component edges. Equivalently, our algorithm forms the components by assigning each vertex u reachable from 
one of these centers to the center that minimizes distQ(t) {u, s) + (5^*^ (ties broken in a consistent manner, e.g., 
lexicographically). Note that because of these "jitters," some centers might not be assigned any vertex, not even 
itself. For centers that are assigned some nodes, we include their components in the output, designating them as 
the components' centers. Finally, we construct G^*"*"^^ by removing nodes that were "seen" in this iteration (i.e., 
the nodes in X^*)) — because they are already part of one of the output components — and adjusting the edge set 
accordingly. 

Analysis. Throughout this analysis, we make reference to various quantities in the algorithm and assume the 
reader's basic familiarity with our algorithm. We begin by proving properties (P1)-(P2). First, we state an 
easy-to-verify fact, which follows immediately by our choice of radius and components' centers. 

Fact 4.2 If vertex u lies in component then dist^^^ {s, u) < Moreover, u e B^^K 

We also need the following lemma to ai^gue about strong diameter. 
Lemma 4.3 If vertex u G ci^\ and vertex v G V^^^ lies on any u-s shortest path in G^^\ then v G 

Proof: Since u G ci*\ Fact 4.2 implies u belongs to B^f^ . But dist^^\v,i) < dist^^\u,i), and hence v 
belongs to si*^ and X^*^ as well. This implies that v is assigned to some component cj*^ ; we claim j = s. 

For a contradiction, assume that j 7^ s, and hence dist^^\v,j) + < dist^^\v, s) + d^f* . In this case 
dist'^^^ {u, j) + 6^j^ < dist^^^ [u, v) + disi*-*-' {v, j) + df* (by the triangle inequaUty). Now using the assumption. 
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Algorithm 4.1 splitGraph {G = {V,E),p) — Split an input graph G = {V,E) into components of hop- 
radius at most p. 

Let G<^^^ = G. Define R p/{2logn). Create empty collection of components C. 

Use dist^*'^ as shorthand for distQ(t) , and define U^*-* (u, r) =' B^tt) (u, r) = {-y G y | dist^*'^ (u, v) < r}. 

Fort = 1,2, ... ,r = 21og2 n, 

1. Randomly sample C y(*), where = cr* = 12n*/'^-i|y(*)| logn, oruse = F^*) if H/(*)| < at. 

2. For each "center" s € S**^*-*, draw Si*'' uniformly at random from Z n [0, R]. 

3. Letr(*) ^ (T - t + 

4. For each center s e S*'*', compute the ball B^*^ = S^*) (s, r(*) - 5^*^). 

5. LetX'*' = U^g5(t)Bi*\ 

6. Create components {Ci*"* | s G 5^*^} by assigning each u S X^*) to the component ci*^ such that s minimizes 
distQ{t) {u, s) + ^i*' (breaking ties lexicographically). 

7. Add non-empty ci*"* components to C. 

8. Set ^ y(*) \ and let G^*+'^'> ^ G^*) [y(*+i)]. Quit early if V^*+^^ is empty. 

Return C. 

this expression is at most dist^*^ (u, v) + dist^*^ {v, s) + = dist^^^ {u, s) + 5^*^ (since ii lies on the shortest 
u-s path). But then, u would be also assigned to Cj^\ a contradiction. ■ 

Hence, for each non-empty component Cs*\ its center s lies within the component (since it lies on the 
shortest path from s to any u G ci*^), which proves (PI). Moreover, by Fact 4.2 and Lemma 4.3, the (strong) 
radius is at most TR, proving (P2). It now remains to prove (P3), and the work and depth bounds. 

Lemma 4.4 For any vertex u V, with probability at least 1 — n~^, there are at most 68 log^ n pairs^ {s, t) 
such that s G 5^*) and u G 5W(s, r(*)), 



We will prove this lemma in a series of claims. 
Claim 4.5 Forte [T] and v G ^W, if\B^^\v,A^+^^)\ > n^"*/^, then v G w.p. at least 1 



n 



-12 



Proof: First, note that for any s G S^^\ — (^s > r^^^ — R = r^^+i) ^ so if s G r(*+^)), then 

V G B^s^ and hence in X''^\ Therefore, 



Pr 



which is the probability that a random subset of F^*) of size at hits the ball S^*) {v, r(*+i) ). But, 



Pr 

which is at least 1 — n~^^. 



5WnsW(^;,r(*+i))/0' 



Claim 4.6 For t G [T] and v eV, the number of s G 5^*^ such that v G B^^\s, r^^^ ) is at most 34 log n w.p. at 
least 1 — n^^. 

*In fact, for a given s, there is a unique £ — if this s is ever chosen as a "starting point." 
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Proof: For t = 1, the size ui = O(logn) and hence the claim follows trivially. For t > 2, we condition on all 
the choices made in rounds 1, 2, . . . , t — 2. Note that if v does not survive in V'^^~^\ then it does not belong to 
either, and the claim is immediate. So, consider two cases, depending on the size of the ball i^^*"^) {y, r^*)) 
in iteration t — \: 

— Case 1. If rW)| > n^~(*~^)/^, then by Claim 3.5, with probability at least 1 — n~^^, we have 

so V would not belong to F^*) and this means no s G will satisfy v G 5W(s,rW), provmg 
the claim for this case. 

— Case 2. Otherwise, (v, rW)| < n^^^*"!)/^. We have 

as sW(?;,r(*)) C (i;, r^*)). Now let X be the number of s such that v G BW(s,r(*)), so X = 

Sse5(*) l{seB(')(i;,r('))}- O^^J" ^he random choice of 5^, 



Pr 

which gives 



s G B^^\v,', 



E[X] = at -Prise B^^\v,A^^) 



< 17 log n. 



To obtain a high probability bound for X, we will apply the tail bound in Lemma 2.1. Note that X is simply 
a hypergeometric random variable with the following parameters setting: total balls = red balls 

and the number balls drawn is at- Therefore, Pr [X > 34 log n] < exp{ — ^ • 34 log n}, so 
X < 34 log n with probability at least 1 — 

Hence, regardless of what choices we made in rounds 1, 2, . . . , t — 2, the conditional probability of seeing 
more than 34 log n different s's is at most n~^. Hence, we can remove the conditioning, and the claim follows. 



Lemma 4.7 If for each vertex u G V, there are at most 681og^n pairs {s,t) such that s G 5^*^ and u G 
B^^\s,A^^), then for an edge uv, the probability that u belongs to a different component than v is at most 
681og^ n/R. 

Proof: We define a center s G S'^*^ as "separating" u and v if n {u, = 1. Clearly, if n, v lie in different 
components then there is some t G [T] and some center s that separates them. For a center s G this can 
happen only if 6s = R — dist{s,u), since dist{s,v) < dist{s,u) — 1. As there are R possible values of 6s, 
this event occurs with probability at most 1/R. And since there are only 68 log^ n different centers s that can 
possibly cut the edge, using a trivial union bound over them gives us an upper bound of 68 log^ n /R on the 
probability. ■ 



To argue about (P3), notice that the premise to Lemma 4.7 holds with probabihty exceeding 1 — o(l) > 
3/4. Combining this with Lemma 4.4 proves property (P3), where the technical condition is the premise to 
Lemma 4.7. 

Finally, we consider the work and depth of the algorithm. These are randomized bounds. Each computation 
of B^^\v,A^^) can be done using a BFS. Since < p, the depth is bounded by 0{plogn) per iteration, 
resulting in 0(plog^ n) after T = O(logn) iterations. As for work, by Lemma 4.4, each vertex is reached by 
at most 0(log^ n) starting points, yielding a total work of 0{m log^ n). 
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4.2 Low-Diameter Decomposition for Multiple Edge Classes 



Extending the basic algorithm to support multiple edge classes is straightforward. The main idea is as follows. 
Suppose we are given a unweighted graph G = (F, E), and the edge set E is composed of k edge classes 
i?i l±) • • • l±) Ek- So, if we run splitGraph on G = {V,E) and p treating the different classes as one, then 
property (P3) indicates that each edge — regardless of which class it came from — is separated (i.e., it goes 



136 1 3 

follows directly from Markov's inequality and the union bounds. 



across components) with probability p = ^ log n. This allows us to prove the following corollary, which 



Corollary 4.8 With probability at least 1/4, for all i G [k], the number of edges in Ei that are between 



272k log^ n 



components is at most \Ei^ ^ 



The corollary suggests a simple way to use splitGraph to provide guarantees required by Theorem 4.1: 
as summarized in Algorithm 4.2, we run splitGraph on the input graph treating all edge classes as one and 
repeat it if any of the edge classes had too many edges cut (i.e., more than \Ei \ As the corollary 

indicates, the number of trials is a geometric random variable with with p = 1/4, so in expectation, it will 
finish after 4 trials. Furthermore, although it could go on forever in the worst case, the probability does fall 
exponentially fast. 

Algorithm 4.2 Partition {G = {V, E = Ei ^ ■ ■ ■ ^ E^) , p) — Partition an input graph G into components 
of radius at most p. 

1. LetC = splitGraph((y,l±)Si),p). 

2. If there is some i such that Ei has more than | | 272^fcJog_n g^jgg^ between components, start over. (Recall 
that k was the number of edge classes.) 

Return C. 

Finally, we note that properties (PI) and (P2) directly give Theorem 4.1(l)-(2) — and the validation step 
in Partition ensures Theorem 4.1(3), setting ci = 272. The work and depth bounds for Partition follow 
from the bounds derived for splitGraph and Corollary 4.8. This concludes the proof of Theorem 4.1. 



5 Parallel Low-Stretch Spanning Trees and Subgraphs 

This section presents parallel algorithms for low-stretch spanning trees and for low-stretch spanning subgraphs. 
To obtain the low-stretch spanning tree algorithm, we apply the consti'uction of Alon et al. [AKPW95] (hence- 
forth, the AKPW construction), together with the parallel graph partition algorithm from the previous section. 
The resulting procedure, however, is not ideal for two reasons: the depth of the algorithm depends on the 
"spread" A — the ratio between the heaviest edge and the lightest edge — and even for polynomial spread, both 
the depth and the average stretch ai^e super-logaiithmic (both of them have a 2'^(^'°s"-iog log") term). Fortu- 
nately, for our application, we observe that we do not need spanning trees but merely low-stretch spai^se graphs. 
In Section 5.2, we describe modifications to this construction to obtain a parallel algorithm which computes 
sparse subgraphs that give us only polylogarithmic average stretch and that can be computed in polylogarithmic 
depth and 0{m) work. We believe that this construction may be of independent interest. 
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5.1 Low-Stretch Spanning Trees 



Using the AKPW construction, along with the Partition procedure from Section 4, we will prove the follow- 
ing theorem: 

Theorem 5.1 (Low-Stretch Spanning Tree) There is an algorithm AKPW(G) which given as input a graph 
G = (y, E, w), produces a spanning tree in 0(log*^(^) n • 2'^(v^'°g"-^°s'°s" ) log A) exp ected depth and d{in) 
expected work such that the total stretch of all edges is bounded by m ■ 2'^(v^°g^*^°siog") 



Algorithm 5.1 AKPW (G = (F, w)) — a low-stretch spanning tree construction. 

i. Normalize the edges so that min{i(;(e) : e e E} ~ 1. 

ii. Let y = 2v'6iogn-iogiogn^ T = [3 log(n) / log y] , z = Ac^yr log^ n. InitiaHze T = 0. 
ill. Divide E into Ei,E2,..., where E^ ^ {e € E \ w{e) G [z'-'^,z')}. 

Let E and = for all i. 

iv. For j = 1,2,..., until the graph is exhausted, 

1. (Ci,C2,...,Cp) =Partition((y(j),W,<,£;|^'^),z/4) 

2. Add a BFS tree of each component to T. 

3. Define graph (1/'-'+^-', by contracting all edges within the components and removing all self-loops (but 
maintaining parallel edges). Create from taking into account the contractions. 

V. Output the n-ee T. 



Presented in Algorithm 5. 1 is a restatement of the AKPW algorithm, except that here we will use our parallel 
low-diameter decomposition for the partition step. In words, iteration j of Algorithm 5.1 looks at a graph 
(y(i)^^(j)) which is a minor of the original graph (because components were contracted in previous iterations, 
and because it only considers the edges in the first j weight classes). It uses Partition((l/, ^j<kEj), z/A) to 
decompose this graph into components such that the hop radius is at most z/A and each weight class has only 
1/y fraction of its edges crossing between components. (Parameters y, z are defined in the algorithm and are 
slightly different from the original settings in the AKPW algorithm.) It then shrinks each of the components 
into a single node (while adding a BFS tree on that component to T), and iterates on this graph. Adding these 
BFS trees maintains the invariant that the set of original nodes which have been contracted into a (super-)node 
in the current graph are connected in T; hence, when the algorithm stops, we have a spanning tree of the original 
graph — ^hopefully of low total stretch. 

We begin the analysis of the total stretch and running time by proving two useful facts: 
Fact 5.2 The number of edges \E^'^^\ is at most \Ei\/y^~^. 

Proof: If we could ensure that the number of weight classes in play at any time is at most r, the number of 
edges in each class would fall by at least a factor of " = 1/y by Theorem 4.1(3) and the definition 

of z, and this would prove the fact. Now, for the first r iterations, the number of weight classes is at most r 
just because we consider only the first j weight classes in iteration j. Now in iteration r + 1, the number of 
surviving edges of Ei would fall to \Ei\/y'^ < \Ei\/n^ < 1, and hence there would only be r weight classes 
left. It is easy to see that this invariant can be maintained over the course of the algorithm. ■ 

Fact 5.3 In iteration j, the radius of a component according to edge weights (in the expanded-out graph) is at 
most z^~^^. 
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Proof: The proof is by induction on j. First, note that by Theorem 4.1(2), each of the clusters computed in 
any iteration j has edge-count radius at most 2;/4. Now the base case j = I follows by noting that each edge 
in El has weight less than z, giving a radius of at most < z^~^^. Now assume inductively that the radius 
in iteration j — 1 is at most . Now any path with z/A edges from the center to some node in the conti"acted 
graph will pass through at most z/A edges of weight at most z^, and at most z/A + l supernodes, each of which 
adds a distance of 2z^; hence, the new radius is at most z^~^^ /A + (z/4 + l)2z^ < z^~^^ as long as 2: > 8. ■ 

Applying these facts, we bound the total stretch of an edge class. 
Lemma 5.4 For any i > 1, strT(-E'j) < 4y^|£'j| (4cit log^ 

Proof: Let e be an edge in Ei contracted during iteration j. Since e G Ei, we know w{e) > By Fact 5.3, 
the path connecting the two endpoints of e in F has distance at most 2z^~^^. Thus, strr(e) < 2z^~^^ / z'''~^ = 
22;.?-*+2. Fact 5.2 indicates that the number of such edges is at most l-Ep^l < \Ei\/y^~^. We conclude that 

i+T-l 

strriE,) < 2z^-'+^m/yi-' 
j=i 

<4y2|F,|(4cirlog3n)"+i 



Proof: [of Theorem 5.1] Summing across the edge classes gives the promised bound on stretch. Now there are 
[log^ A] weight classes E'i's in all, and since each time the number of edges in a (non-empty) class drops by a 
factor of y, the algorithm has at most 0(log A + r) iterations. By Theorem 4. 1 and standard techniques, each it- 
eration does 0(m log^ n) work and has 0{z log^ n) = 0(log'^^'^^ n ■ 20(Viog n-iog logn)-) (jgpth ju expectation. ■ 



5.2 Low-Stretch Spanning Subgraphs 

We now show how to alter the parallel low-stretch spanning tree construction from the preceding section to give 
a low-stretch spanning subgraph construction that has no dependence on the "spread," and moreover has only 
polylogarithmic stretch. This comes at the cost of obtaining a sparse subgraph with n — 1 + 0{m/ polylogn) 
edges instead of a tree, but suffices for our solver application. The two main ideas behind these improvements 
are the following: Firstly, the number of surviving edges in each weight class decreases by a logarithmic factor 
in each iteration; hence, we could throw in all surviving edges after they have been whittled down in a constant 
number of iterations — this removes the factor of average stretch and the depth. 

Secondly, if A is lai^ge, we will identify certain weight-classes with 0{m/ polylogn) edges, which by setting 
them aside, will allow us to break up the chain of dependencies and obtain 0(polylog n) depth; these edges will 
be thrown back into the final solution, adding 0(mj polylogn) extra edges (which we can tolerate) without 
increasing the average stretch. 

5.2.1 The First Improvement 

Let us first show how to achieve polylogarithmic stretch with an ultra-sparse subgraph. Given parameters A G 
Z>oand/3 > C2 log^ n (where C2 = 2- (4ci(A + l))2('^-i)), we obtain the new algorithm SparseAKPW(G, A, /3) 
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by modifymg Algorithm 5.1 as follows: 

(1) use the altered parameters y = ^/3/ log'^ n and z = 4ciy(A + 1) log'^ n; 

(2) in each iteration j, call Partition with at most A+1 edge classes — keep the A classes . . . ,Ej_-^^^, 
but then define a "generic bucket" Eq^ := \Jji<j^\E^p as the last part of the partition; and 

(3) finally, output not just the tree T but the subgraph G = T\J (Ui>i£'f ^^•'). 

Lemma 5.5 Given a graph G, parameters A € Z>o and /3 > C2 log"^ n (where C2 = 2 • (4ci(A + l))2''^~^^j the 
algorithm SparseAKPW(G, A, /3) outputs a subgraph ofG with at most n— l+m(c2(log'^ n//3))^ edges and total 
stretch atfnostmfS'^ log'^'^"'''^ n. Moreover, the expected work is 0{m) and expected depth is 0((ci/3/c2)A log^ n(log A+ 
logn)). 



Proof: The proof pai^allels that for Theorem 5.1. Fact 5.3 remains unchanged. The claim from Fact 5.2 now 
remains true only for j G {«,..., i + A — 1}; after that the edges in E^f^ become part of E^\ and we only 
give a cumulative guarantee on the generic bucket. But this does hurt us: if e G E'j is contracted in iteration 
j < i + \ — 1 (i.e., it lies within a component formed in iteration j), then strg(e) < 2z^~'^~^'^. And the edges 
of Ei that survive till iteration j > i + \ have stretch 1 because they are eventually all added to G; hence we 
do not have to worry that they belong to the class E^ for those iterations. Thus, 

j+A-l 

strg(i^,)< ^ 2z^-'+^ ■\Ei\/y^-' <Ay\-f~^\Ei\. 

j=i y 



Summing across the edge classes gives strg(£') < 4y^(|)^ ^m, which simplifies to 0(m/3^ log^^"''^ n). 
Next, the number of edges in the output follows directly from the fact T can have at most n — 1 edges, and the 
number of extra edges from each class is only a l/y'^ fraction (i.e., \E^^~^^^ \ < from Fact 5.2). Finally, 

the work remains the same; for each of the (log A + r) distance scales the depth is still 0{z log^ n), but the 
new value of z causes this to become 0((ci/3/c2)Alog^ n). ■ 



5.2.2 The Second Improvement 

The depth of the SparseAKPW algorithm still depends on log A, and the reason is straightforward: the graph 
G^^^ used in iteration j is built by taking G^^^ and contracting edges in each iteration — hence, it depends on all 
previous iterations. However, the crucial observation is that if we had r consecutive weight classes Ei's which 
are empty, we could break this chain of dependencies at this point. However, there may be no empty weight 
classes; but having weight classes with relatively few edges is enough, as we show next. 

Fact 5.6 Given a graph G = (V, E) and a subset of edges F C E, let G' = G\F be a potentially disconnected 
graph. If G' is a subgraph of G' with total stretch strQ,{E{G' )) < D, then the total stretch of E on G := G'UF 
is at most \F\ + D. 

Consider a graph G = {V,E,w) with edge weights w{e) > 1, and let Ei{G) := {e € E{G) \ w{e) G 
[z^~^,z'^)} be the weight classes. Then, G is called (7, T)-well-spaced if there is a set of special weight classes 
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{Ei{G)}i^j such that for each z G /, (a) there are at most 7 weight classes before the following special weight 
class min{z' G 7 U {00} | i' > i}, and (b) the r weight classes £'j_i(G), £'j„2(G'), . . . , Ei^r{G) preceding i 
are all empty. 

Lemma 5.7 Given any graph G = {V, E), r G Z_|_, and 9 < 1, there exists a graph G' = {V, E') which is 
{At / 9 ,T)-well-spaced, and \E' \E\ < 9 ■ \E\. Moreover, G' can be constructed in 0{m) work and O(logn) 
depth. 

Proof: Let 5 = ; note that the edge classes for G we Ei, . . . ,Es, some of which may be empty. Denote by 
Ej the union Ui^jEi. We construct G' as follows: Divide these edge classes into disjoint groups Ji, J2, . . . C 
[6], where each group consists of \t/9'] consecutive classes. Within a group Jj, by an averaging argument, 
there must be a range Lj C Jj of r consecutive edge classes that contains at most a 9 fraction of all the edges 
in this group, i.e., < ^ • l^jj and |Lj| > r. We form G' by removing these the edges in all these groups 
Li's from G, i.e., G' = {V,E \ (UiELj). This removes only a 9 fraction of all the edges of the graph. 

We claim G' is {At/9, T)-well-spaced. Indeed, if we remove the group Li, then we designate the smallest 
j G [d] such that j > max{/ G Li} as a special bucket (if such a j exists). Since we removed the edges in El^, 
the second condition for being well-spaced follows. Moreover, the number of buckets between a special bucket 
and the following one is at most 

2\t/9] - (t - 1) < At/9. 

Finally, these computations can be done in 0{m) work and O(logn) depth using standard techniques [JaJ92, 
Lei92]. ■ 



Lemma 5.8 Let r = 31ogn/logy. Given a graph G which is {^,t) -well-spaced, SparseAKPW ca;?i be com- 
puted on G with 0{m) work and 0(^7A/3 log^ n) depth. 

Proof: Since G is (7, T)-well-spaced, each special bucket i G / must be preceded by r empty buckets. Hence, 
in iteration i of SparseAKPW, any surviving edges belong to buckets -Ej-r or smaller. However, these edges 
have been reduced by a factor of y in each iteration and since r > log^ n^, all the edges have been contracted 

(i) 

in previous iterations — i.e., Ey for i < i is empty. 

Consider any special bucket i: we claim that we can construct the vertex set V^^^ that SparseAKPW sees at 
the beginning of iteration i, without having to run the previous iterations. Indeed, we can just take the MST 
on the entire graph G = G^^\ retain only the edges from buckets Ei^j. and lower, and contract the connected 
components of this forest to get V^''\ And once we know this vertex set V^''\ we can drop out the edges from 
Ei and higher buckets which have been contracted (these are now self-loops), and execute iterations i, i + 1, . . . 
of SparseAKPW without waiting for the preceding iterations to finish. Moreover, given the MST, all this can be 
done in 0{m) work and O(logn) depth. 

Finally, for each special bucket i in parallel, we start running SparseAKPW at iteration i. Since there are at 
most 7 iterations until the next special bucket, the total depth is only 0(72 log^ n) = 0{^^\f5 log^ n). ■ 

Theorem 5.9 (Low-Stretch Subgraphs) Given a weighted graph G, X £ Z>o, and f3 > C2log^n (where 
C2 = 2 • (4ci(A + 1)) 2^^"^)), there is an algorithm LSSubgraph(G, /3, A) that finds a subgraph G such that 

L \E{G)\<n-l + m{cLs'^)' 
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2. The total stretch (of all E{G) edges) in the subgraph G is at most by m.p'^ log^^^^ n, 

where cis (= C2 + l)is a constant. Moreover, the procedure runs in 0{m) work and 0(A/3^^^ log"^"^^ n) depth. 
If X = 0(1) and /3 = polylog(n), the depth temi simplifies to 0(log'^^^^ n). 

Proof: Given a graph G, we set r = Slogn/logy and 9 = (log^ n/f3)^, and apply Lemma 5.7 to delete at 
most 9m edges, and get a {'iT/9, r)-well-spaced graph G'. Let m' = \E'\. On this graph, we run SparseAKPW 
to obtain a graph G' with n — 1 + m' {c2(iog^ n / (3))^ edges and total stretch at most m'/3^ log^^+^ n; moreover, 
Lemma 5.8 shows this can be computed with 0(m) work and the depth is 

O (^^{iT/9)X/3 log2 = 0(A/3^+^ log^'-^^ n). 

Finally, we output the graph G = G' U {E{G) \ E{G')); this gives the desired bounds on stretch and the 
number of edges as implied by Fact 5.6 and Lemma 5.5. ■ 



6 Parallel SDD Solver 

In this section, we derive a parallel solver for symmetric diagonally dominant (SDD) linear systems, using the 
ingredients developed in the previous sections. The solver follows closely the line of work of [ST03, ST06, 
KM07, KMPIO]. Specifically, we will derive a proof for the main theorem (Theorem 1.1), the statement of 
which is reproduced below. 

Theorem 1.1. For any fixed 9 > Q and any e > 0, there is an algorithm SDDSolve that on input 
an SDD matrix A and a vector b computes a vector x such that — j4+6m < e ■ \\A^b\\A in 
0{m log*^(^) n log i) work and 0(m^/3+^ log i) depth. 

In proving this theorem, we will focus on Laplacian linear systems. As noted earlier, linear- systems on SDD 
matrices are reducible to systems on graph Laplacians in 0(log(m + n)) depth and 0{m + n) work [Gre96]. 
Furthermore, because of the one-to-one correspondence between graphs and their Laplacians, we will use the 
two terms interchangeably. 

The core of the near-linear time Laplacian solvers in [ST03, ST06, KMPIO] is a "preconditioning" chain of 
progressively smaller graphs {Ai = A, A2, ■ ■ ■ , Ad), along with a well-understood recursive algorithm, known 
as recursive preconditioned Chebyshev method — rPCh, that traverses the levels of the chain and for each visit 
at level i < d, performs 0(1) matrix- vector multiplications with Ai and other simple vector- vector operations. 
Each time the algorithm reaches level d, it solves a linear system on Ad using a direct method. Except for 
solving the bottom-level systems, all these operations can be accomplished in linear work and 0(log(m + n)) 
depth. The recursion itself is based on a simple scheme; for each visit at level i the algorithm makes at most 
recursive calls to level i + l, where > 2 is a fixed system-independent integer. Therefore, assuming we have 
computed a chain of preconditioners, the total required depth is (up to a log) equal to the total number of times 
the algorithm reaches the last (and smallest) level Ad. 
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6.1 Parallel Construction of Solver Chain 



The construction of the preconditioning chain in [KMPIO] relies on a subroutine that on input a graph Ai, 
constructs a slightly sparser graph Bi which is spectrally related to Ai. This "incremental sparsification" routine 
is in turn based on the computation of a low-stretch tree for A^. The parallelization of the low-stretch tree 
is actually the main obstacle in pai^allelizing the whole solver presented in [KMPIO]. Crucial to effectively 
applying our result in Section 5 is a simple observation that the sparsification routine of [KMPIO] only requires a 
low-stretch spanning subgraph rather than a tree.Then, with the exception of some parameters in its construction, 
the preconditioning chain remains essentially the same. 

The following lemma is immediate from Section 6 of [KMPIO]. 

Lemma 6.1 Given a graph G and a subgraph GofG such that the total stretch of all edges in G with respect 
to G is m- S, a parameter on condition number n, and a success probability 1 — 1/^, there is an algorithm that 
constructs a graph H such that 



Although Lemma 6.1 was originally stated with G being a spanning tree, the proof in fact works without 
changes for an ai^bitraiy subgraph. For our purposes, ^ has to be at most O(logn) and that introduces an 
additional O (log log n) term. For simplicity, in the rest of the section, we will consider this as an extra logn 
factor. 



Applying Lemma 6.1 with k, = jq \og^^ n gives H such that G ^ H ^ log^^ n ■ G and \E{H)\ is at most 



1. G<H ^K-G,and 

2. \E{H)\ = \E{G)\ + {as ■ S\ogn\ogi)/K 

in 0(log^ n) depth and 0{m log^ n) work, where cjs is an absolute constant. 




where cpc is an absolute constant. 



Proof: Let G = LSSubgraph(G, A, log'' n). Then, Theorem 5.9 shows that \E{G)\ is at most 




. , f Cls ■ log-^ . , f cls \ 
n — 1 + m { = n — 1 + m [ 5 — 

V /? y Viog'^-'ny 

, the total stretch of all edges in G with respect to G is at most 



< n 




1 + m ■ 



jQ„r;A— 2A— 3fe— 5 



n 



< n 



1 + m ■ 



lQgr,A-2r?-4A^- 
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We now give a more precise definition of the preconditioning chain we use for the parallel solver by giving 
the pseudocode for constructing it. 

Definition 6.3 (Preconditioning Chain) Consider a chain of graphs 

C = {Ai = A,Bi,A2,...,Ad), 

and denote by rii and nii the number of nodes and edges of Ai respectively. We say that C is preconditioning 
chain /or A if 

1. Bi = IncrementalSparsif y(Aj). 

2. Aij^i = GreedyElimination(i?j). 

3. Ai ^ Bi ^ 1/10 • KiAi, for some explicitly known integer Kj. ^ 

As noted above, the rPCh algorithm relies on finding the solution of linear systems on Ad, the bottom-level 
systems. To parallelize these solves, we make use of the following fact which can be found in Sections 3.4. and 
4.2 of [GVL96]. 

Fact 6.4 A factorization LL^ of the pseudo-inverse of an n-by-n Laplacian A, where L is a lower triangular 
matrix, can be computed in 0{n) time and 0{n^) work, and any solves thereafter can be done in 0(log n) time 
and 0{'n?) work. 

Note that although A is not positive definite, its null space is the space spanned by the all Is vector when the 
underlying graph is connected. Therefore, we can in turn drop the first row and column to obtain a semi-definite 
matrix on which LU factorization is numerically stable. 

The routine GreedyElimination is a partial Cholesky factorization (for details see [ST06] or [KMPIO]) 
on vertices of degree at most 2. From a graph-theoretic point of view, the routine GreedyElimination can 
be viewed as simply recursively removing nodes of degree one and splicing out nodes of degree two. The 
sequential version of GreedyElimination returns a graph with no degree 1 or 2 nodes. The parallel version 
that we present below leaves some degree-2 nodes in the graph, but their number will be small enough to not 
affect the complexity. 

Lemma 6.5 IfG has n vertices and n — 1 + m edges, then the procedure GreedyElimination(G) returns a 
graph with at most 2m — 2 nodes in 0(n + m) work and 0(log n) depth whp. 

Proof: The sequential version of GreedyElimination(G) is equivalent to repeatedly removing degree 1 
vertices and splicing out 2 vertices until no more exist while maintaining self-loops and multiple edges (see, 
e.g., [ST03, ST06] and [KouOV, Section 2.3.4]). Thus, the problem is a slight generalization of parallel tree 
contraction [MR89]. In the pai^allel version, we show that while the graph has more than 2m — 2 nodes, we 
can efficiently find and eliminate a "large" independent set of degree two nodes, in addition to all degree one 
vertices. 

We alternate between two steps, which are equivalent to Rake and Compress in [MR89], until the vertex 
count is at most 2m — 2: 

Mark an independent set of degree 2 vertices, then 

'The constant of 1/10 in the condition number is introduced only to simplify subsequent notation. 
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1. Contract all degree 1 vertices, and 

2. Compress and/or contract out the marked vertices. 

To find the independent set, we use a randomized marking algorithm on the degree two vertices (this is used 
in place of maximal independent set for work efficiency): Each degree two node flips a coin with probability | 
of turning up heads; we mark a node if it is a heads and its neighbors either did not flip a coin or flipped a tail. 

We show that the two steps above will remove a constant fraction of "extra" vertices. Let G is a multigraph 
with n vertices and m+ji— 1 edges. First, observe that if all vertices have degree at least three then n < 2(m— 1) 
and we would be finished. So, let T be any fixed spanning tree of G; let ai (resp. 02) be the number of vertices 
in T of degree one (resp. two) and 03 the number those of degree three or more. Similarly, let 61, 62. and 63 be 
the number vertices in G of degree 1, 2, and at least 3, respectively, where the degree is the vertex's degree in 
G. 

It is easy to check that in expectation, these two steps remove 61 + ^62 > 61 + ^62 vertices. In the following, 
we wiU show that 61 + ^62 > 7 An, where An = n — (2m — 2) = n — 2m + 2 denotes the number of "extra" 
vertices in the graph. Consider non-tree edges and how they are attached to the tree T. Let mi, m2, and 7713 be 
the number of attachment of the following types, respectively: 

(1) an attachment to x, a degree 1 vertex in T, where x has at least one other attachment. 

(2) an attachment to x, a degree 1 vertex in T, where x has no other attachment. 

(3) an attachment to a degree 2 vertex in T. 

As each edge is incident on two endpoints, we have mi + m2 + 7713 < 2m. Also, we can lower bound bi 
and 62 in terms of m^'s and a^'s: we have bi > ai — mi/2 — m2 and &2 > "22 + 02 — m-^. This gives 

^1 + 7^2 > 7(«i - "1-1/2 - m2) + 7(m2 + 02 - m3) 
= |ai + - j{mi + m2 + m3) 
> |oi + - f m. 

Consequently, 61 + 762 > 7(201 + 02 — 2m) > ^ • An, where to show the last step, it suffices to show that 
n + 2 < 2ai + 02 for a tree T of n nodes. WLOG, we may assume that all nodes of T have degree either one or 
three, in which case 2ai = n + 2. Finally, by Chernoff bounds, the algorithm will finish with high probability 
in 0(log n) rounds. ■ 



6.2 Parallel Performance of Solver Chain 

Spielman and Teng [ST06, Section 5] gave a (sequential) time bound for solving a lineai^ SDD system given a 
preconditioner chain. The following lemma extends their Theorem 5.5 to give parallel runtime bounds (work 
and depth), as a function of Kj's and m^'s. We note that in the bounds below, the m^ term arises from the dense 
inverse used to solve the linear system in the bottom level. 

Lemma 6.6 There is an algorithm that given a preconditioner chain C = {Ai = A,A2,---, A^) for a matrix 
A, a vector b, and an error tolerance e, computes a vector x such that 

\\x-A^b\\A < e-\\A+b\\A, 
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with depth bounded by 

i E n V^)lognlog(i) < oil n V^)lognlog(l) ) 

\l<i<dl<j<j / \\l<3<d I I 

and work bounded by 

^ mi-JJ^ + m^ ]J ^ log(i). 

^l<i<d-l i<i i<j<d J 

To reason about Lemma 6.6, we will rely on the following lemma about preconditioned Chebyshev iteration 
and the recursive solves that happen at each level of the chain. This lemma is a restatement of Spielman and 
Teng's Lemma 5.3 (slightly modified so that the does not involve a constant, which shows up instead as 
constant in the preconditioner chain's definition). 

Lemma 6.7 Given a preconditioner chain of length d, it is possible to construct linear operators solveyi. for 
all i < d such that 

(1 - e^2)^+ ^ solvcA, + e^) 

and solvCyi. is a polynomial of degree involving so\yeAi+i cind 4 matrices with rrii non-zero entries (from 
GreedyEliminationj. 

Armed with this, we state and prove the following lemma: 
Lemma 6.8 For £ > 1, given any vector b, the vector solve^^ • b can be computed in depth 

logn n 

l<i<dl<j<i 

and work 

e<i<d~l i<j<i i<j<d 



Proof: The proof is by induction in decreasing order on £. When d = £, all we are doing is a matrix multipli- 



cation with a dense inverse. This takes O(logn) depth and 0(?ti^) work. 



Suppose the result is true for £ + 1. Then since soIvba^ can be expressed as a polynomial of degree 
involving an operator that is solveyi^^j multiplied by at most 4 matrices with O(m^) non-zero entries. We have 
that the total depth is 

log + ^ • log 71 ^ Y{ \^ 

\ e+i<i<de+i<j<i 

= logn J2 II ^ 

l<i<dl<j<i 

and the total work is bounded by 

\l+l<i<d-l i+l<j<i e+l<j<d 

J2 u v^+"^d n v^- 

e<i<d-l i<j<i l<j<d 
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Proof: [of Lemma 6.6] The e-accuracy bound follows from applying preconditioned Chebyshev to solveAi sim- 
ilarly to Spieknan and Teng's Theorem 5.5 [ST06], and the running time bounds follow from Lemma 6.8 when 



6.3 Optimizing the Chain for Depth 

Lemma 6.6 shows that the algorithm's performance is determined by the settings of Kj's and mj's; however, as 
we will be using Lemma 6.2, the number of edges rrij is essentially dictated by our choice of Kj. We now show 
that if we terminate chain earlier, i.e. adjusting the dimension Act to roughly 0{m}/'^ loge~^), we can obtain 
good parallel performance. As a first attempt, we will set k/s uniformly: 

Lemma 6.9 For any fixed 9 > 0, if we construct a preconditioner chain using Lemma 6.2 setting A to some 
proper constant greater than 21, rj = X and extending the sequence until < m^^^"^ for some 6 depending 
on X, we get a solver algorithm that runs in 0{m^^^~^^ log(l/e)) depth and 0(mlog 1/e) work as X ^ oo, 
where e is the accuracy precision of the solution, as defined in the statement of Theorem 1.1. 

Proof: By Lemma 6. 1 , we have that mj+i — the number of edges in level i + 1 — is bounded by 



i = l. 



0{mi ■ 



) = 0{mi 




which can be repeatedly apply to give 




Therefore, when A > 12, we have that for each i < d, 





< 0{m) 



Now consider the term involving nid. We have that d is bounded by 
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Combining with the Ki = log^ n, we get 



n v^''^' 

l<i<a! 

^ \2/n\ (|+<5)logm/log(clogn^(^-6)) 



exp I log log n—[- + oj- 



2 3 A(A — 6) log log n — log cpc 



< exp I log log n — ( — h 0) 



2 '3 'A(A - 7) log log n 
(since log cpc > — logn) 

exp(logn^(- + -) 



/ 1 I (5 -1 A 

= 0(m^3 + 2) — ) 
Since = 0(m3~^), the total work is bounded by 

'1 , S\ X I 2 



/ 1 I i5 ^ A I 2 r, r -1 I 7 r A-14 

0(,n(3 + 2)A37 + 3-2'5) = 0(m^+A^-^— ) 



So, setting (5 > suffices to bound the total work by 0{m). And, when 6 is set to xtH' ^^^^^ parallel 
running time is bounded by the number of times the last layer is called 



Yl jK{nj) < 0(m^3 + 2(A-14))A-7) 



< 0(r7T,3 + A^+2(A-14)(A-7) ) 

< 0{m^^^) when A > 21 

Setting A arbitrarily large suffices to give 0{m^/^~^^) depth. ■ 

To match the promised bounds in Theorem 1.1, we improve the performance by reducing the exponent on 
the log n term in the total work from A^ to some large fixed constant while letting total depth still approach 

Proof: [of Theorem 1.1] Consider setting A = 13 and rj > X. Then, 

7 

r/A - 2r7 - 4A > r/(A - 6) > — r/A 

We use C4 to denote this constant of j^, namely C4 satisfies 

cpc/ log^^-2^"^^ n < CPC I log^^"^ n 
We can then pick a constant threshold L and set Ki for alH < L as follows: 

Ki = log^' n, = log^^''*^^' n, • • • , = log^^'^^)'"''^' n 
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To solve Al, we apply Lemma 6.9, which is analogous to setting A^, . . . ,Ad uniformly. The depth required in 
constructing these preconditioners is 0{mci + X]j=i(2c4)-^"^A^), plus 0{md) for computing the inverse at the 
last level — for a total of 0(111^) = 0{m^/^). 

As for work, the total work is bounded by 

n n 

i<d l<j<i ^<j<d 
i<L l<j<i 

+ n ^ • n ^ + n 

\^<j<L / \ i>L L<j<i L<j<d 
Yl i Yl rriL^/^ 

i<L l<j<i \^<j<L / 

= j^?Tii Yl 

i<L l<j<i 



2<j<i V 



m 



,C4 



The first inequality follows from the fact that the exponent of log" in can be arbitrarily large, and then 
applying Lemma 6.9 to the solves after level L. The fact that mj+i < rui ■ 0(1/^^") follows from Lemma 6.2. 

Since L is a constant, Okjxl ^ O(polylogn), so the total depth is still bounded by 0{m^/^'^^) by 
Lemma 6.9. ■ 



7 Conclusion 



We presented a near linear-work parallel algorithm for constructing graph decompositions with strong-diameter 
guarantees and parallel algorithms for constructing 2*^(^^°s " log log ") -stretch spanning trees and 0(log'^^^^ n)- 
stretch ultra-sparse subgraphs. The ultra-sparse subgraphs were shown to be useful in the design of a near 
lineai^-work parallel SDD solver. By plugging our result into previous frameworks, we obtained improved 
parallel algorithms for several problems on graphs. 

We leave open the design of a (near) lineai^-work parallel algorithm for the construction of a low-stretch 
tree with polylogarithmic stretch. We also feel that the design of (neaiO work-efficient 0(log'^*^^^ n)-depth SDD 
solver is a very interesting problem that will probably require the development of new techniques. 
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